During the data cleaning I excluded 64 rows that had 0 suicide number per year and also 7 countries that had 4 or less years of records.
suicides_year_country <- group_by(data, country, year) %>% summarise(suicides_per_yer=sum(suicides_no)) %>% arrange(suicides_per_yer)
nrow(suicides_year_country[suicides_year_country[,'suicides_per_yer']<1,])
num_year_records <- group_by(data, country) %>% summarise(n_rows=n(), years = n_rows/12) %>% arrange(years)
num_year_records[num_year_records['years']<=4,]
data <- group_by(data, country, year) %>% filter(sum(suicides_no)!=0)
data <- data %>% filter(!(country %in% head(num_year_records$country, 6)))
In this section we will explore the dataset in a global perspective. Our main interest will be how worldwide number of suicides developed through the years.
Let’s plot the number of suicides per 100k of global population:
group_by(data, year) %>% summarise(suicides_no = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=year, y=suicides_no))+
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = x_year_ticks))+
labs(y='Suicides per 100k people')
In this plot we can see that the global suicide ratio was rasing from 1985 to 1995 and then started to steadily decrase.
Let’s have a look at suicide numbers in different demographic groups proportionally to the global population.
Data plots by gender and age groups.
suicides_no_per_year_pop <- group_by(data, year) %>% mutate(suicides_per_pop=suicides_no/sum(population)*100000)
group_by(suicides_no_per_year_pop, year, sex) %>% summarise(suicides_t_pop = sum(suicides_per_pop)) %>%
ggplot(aes(x=year, y=suicides_t_pop, colour=sex))+
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = x_year_ticks))+
labs(y='Suicides per 100k people')
group_by(suicides_no_per_year_pop, year, age) %>% summarise(suicides_t_pop = sum(suicides_per_pop)) %>%
ggplot(aes(x=year, y=suicides_t_pop, colour=age))+
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = x_year_ticks))+
labs(y='Suicides per 100k people')
group_by(suicides_no_per_year_pop, year, age, sex) %>% summarise(suicides_t_pop = sum(suicides_per_pop)) %>%
ggplot(aes(x=year, y=suicides_t_pop, colour=age))+
facet_grid(sex ~., scales = "free_y")+
geom_line()+
geom_point()+
labs(y='Suicides per 100k people')
In the graphs above we can see firstly that there is a big disproportion by gender, men commit suicides far more often than women. Then we see that w.r.t global population people in different age groups have different suicide rate. My assumption is that it really depends on a proportion of a population of a certain age, so let’s check how population is distributed by gender and age.
data_total_pop_year <- group_by(data, year) %>% mutate(total_pop_year = sum(population))
group_by(data_total_pop_year, year, sex) %>% mutate(pop_prop = population/total_pop_year) %>% summarise(pop_prop = sum(pop_prop))%>%
ggplot(aes(x=year, y=pop_prop, fill=sex))+
geom_bar(stat='identity')+
labs(y='Population proportion')
group_by(data_total_pop_year, year, age) %>% mutate(pop_prop = population/total_pop_year) %>% summarise(pop_prop = sum(pop_prop))%>%
ggplot(aes(x=year, y=pop_prop, fill=age))+
geom_bar(stat='identity')+
labs(y='Population proportion')
So, from plots above we can see that ratio of males and females is roughly the same, with a little favor towards females. From the second plot we can see that people in age of 75+ is the smallest group, then come people in groups 15-24 and 25-34, after that 5-14, then 55-74 and the biggest group is 35-54. So it’s not surprising that in plots of suicides ratio w.r.t global population the biggest ratio is among people in age 35-54, and the good thing is that despite the fact that children in age 5-14 is not the smallest group, their suicide ratio is quite small. Then the other groups behave more or less according to their proportion in global population, only it seems that the proportion of suicide ration of 75+ age group is a bit bigger than their population proportion.
Now let’s see how those numbers differ w.r.t population of a given demographic group (for example, we will dvide a total number of suicides commited by males in a given year and divide it by total global population of males in a given year and scale it by 100k):
Firstly, let’s explore number of suicides per population of males and femalеs:
group_by(data, year, sex) %>% summarise(suicides_no = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=year, y=suicides_no, colour=sex))+
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = x_year_ticks))+
labs(y='Suicides per 100k people')
group_by(data, year, sex) %>% summarise(suicides_no = sum(suicides_no)/sum(population)*100000)%>%
ggplot(aes(x=sex, y=suicides_no, fill=sex))+
geom_boxplot()+
labs(x='Gender', y='Suicides per 100k people')+
theme(legend.position = 'none')
Again, since the proportion of population between males and females is approx. the same this plot looks quite the same as the previous one.
Let’s plot suicide ratio by age population
group_by(data, year, age) %>% summarise(suicides_no = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=year, y=suicides_no, colour=age))+
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = x_year_ticks))+
labs(y='Suicides per 100k people')
group_by(data, age, year) %>% summarise(suicides_no = sum(suicides_no)/sum(population)*100000)%>%
ggplot(aes(x=age, y=suicides_no, fill=age))+
geom_boxplot()+
labs(x='Age', y='Suicides per 100k people')+
theme(legend.position = 'none')
So here we can see that the suicide ratio increases with the age.
Number of suicides by age/gender population:
group_by(data, year, age, sex) %>% summarise(suicides_no = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=year, y=suicides_no, colour=age))+
facet_grid(sex ~ ., scales = 'free_y')+
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = 6))+
labs(y='Suicides per 100k people')
group_by(data, year, sex, age) %>% summarise(suicides_no = sum(suicides_no)/sum(population)*100000)%>%
ggplot(aes(x=sex, y=suicides_no, fill=age))+
geom_boxplot()+
labs(x='Gender', y='Suicides per 100k people')
Let’s start by ploting a graph of suicide development by continents
group_by(data, year, continent) %>% summarise(cont_pop=sum(population), s_no=sum(suicides_no), s_no_per_p = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=year, y=s_no_per_p, colour = continent))+
facet_grid(continent ~ ., scales = "free_y") +
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = 6))+
theme(legend.position = 'none')+
labs(y='Suicides per 100k')
Firstly, it’s worth to mention that thera are only 3 countries present for Africa (South Africa, Mauritius and Seychelles). So the data for this continent is not quite representative. Also, there are only 4 countries for Oceania (Australia, Fiji, Kiribati and New Zeland), however for this continent those countries account for the majority of population: Australia 59%, New Zeland 11.5%, Fiji 2.18%, Kiribati 0.28% according to this Wikipedia article.
Ok, so we can see a good trend in Europe that suicide rate is steadily decreasing starting from 1995. But for Oceania and Americas the trend is quite worrying showing an increase in suicide rate. And for Asia it looks like the rate oscilates somwhere around the mean.
group_by(data, year, continent) %>% summarise(suicides_per_100 = sum(suicides_no)/sum(population)*100000)%>%
ggplot(aes(x=continent, y=suicides_per_100, fill=continent))+
geom_boxplot()+
theme(legend.position = 'none')+
labs(x='Continent', y='Suicides per 100k people')
group_by(data, year, sex, continent) %>% summarise(suicides_per_100 = sum(suicides_no)/sum(population)*100000)%>%
ggplot(aes(x=continent, y=suicides_per_100, fill=sex))+
geom_boxplot()+
labs(x='Continent', y='Suicides per 100k people')
group_by(data, year, age, continent) %>% summarise(suicides_per_100 = sum(suicides_no)/sum(population)*100000)%>%
ggplot(aes(x=continent, y=suicides_per_100, fill=age))+
geom_boxplot()+
labs(x='Continent', y='Suicides per 100k people')
From the plots above we can clearly see that the biggest suicide ratio is in Europe followed by Asia, Oceania and Americas. Suicide ratio by gender again not in favor of men as we’ve seen in global statistics. Suicide ratio by age is intresting in Oceania - it dosn’t follow the global trend, that suicide ratio increases with age, but we can see that the ratio of age group 25-34 and suicide ratio of group 15-24 is higher than that of group 55-74.
Let’s have a look on average statistics by countries. Down below I plot average suicide rate by countries, proprtion of suicide by age and gender grups.
group_by(data, country, continent) %>% summarise(suicides_per_100k = sum(suicides_no)/sum(population)*100000)%>%
ggplot(aes(x=reorder(country, suicides_per_100k), y=suicides_per_100k, fill=continent))+
geom_bar(stat='identity')+
coord_flip()
group_by(data, country, age) %>% summarise(suicides_per_100k = sum(suicides_no)/sum(population)*100000)%>%mutate(perc=suicides_per_100k/sum(suicides_per_100k))%>%
ggplot(aes(x=country, y=perc, fill=age))+
geom_bar(stat="identity")+
coord_flip()
group_by(data, country, sex) %>% summarise(suicides_per_100k = sum(suicides_no)/sum(population)*100000)%>%mutate(perc=suicides_per_100k/sum(suicides_per_100k))%>%
ggplot(aes(x=country, y=perc, fill=sex))+
geom_bar(stat='identity')+
coord_flip()
So we can see few things here, firstly is that as expected from previous analysis European countries are among the countries with high suicide rate. It’s also looks like among the European countries those that belong to Eastern Europe have higher rate of suicide. It would be actually intresting to have a finer granularity of georaphical area to analyse this further. Then we see that all countries have a high disproportion of suicide rate by males and females, so this trend looks quite universal.
To analyse the impact of GDP per capita we will first plot data of few selected countries to see if there are some common trends.
Plots below show development of suicide rate with time by country, development of GDP per capita with time and, GDP per capita vs suice rate.
countries <- c("Lithuania", "Russian Federation", "Republic of Korea", "Cuba", "Singapore", "Spain", "Argentina", "Turkey", "Qatar", "South Africa", "Switzerland")
countries_data <- data%>%filter((country %in% countries))
group_by(countries_data, year, country) %>% summarise(suicides_per_100k = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=year, y=suicides_per_100k, col=country))+
geom_point()+
geom_line()+
facet_wrap(~country, scale='free_y')+
theme(legend.position = 'none')+
labs(y='Suicide per 100k')
group_by(countries_data, year, country) %>% summarise(gdp_per_capita = mean(gdp_per_capita)) %>%
ggplot(aes(x=year, y=gdp_per_capita, col=country))+
geom_line()+
geom_point()+
facet_wrap(~country, scale='free_y')+
theme(legend.position = 'none')+
labs(y='GDP per capita')
group_by(countries_data, year, country) %>% summarise(gdp_per_capita = mean(gdp_per_capita), suicides_per_100k = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=gdp_per_capita, y=suicides_per_100k, col=country))+
geom_point()+
geom_smooth(method = 'loess', formula=y~x)+
facet_wrap(~country, scale='free')+
theme(legend.position = 'none')+
labs(x='GDP per capita', y='Suicide rate per 100k people')
So what we can see here is that different countries can have different trend in suicide rates (we can see increase, decrease and stagnation in suicide trends in this subset), which is quite expected. Then for all these countries gdp per capita in general seems to be growing with time. And when we plot gdp vs suicide rate it doesn’t look like there is uniform trend for this relation. But if we look at all these plots togather, we see that gdp increases with time and that there is a trend in suicide rates vs time. So it looks like gdp and time have somewhat linear relationship and gdp and basically all changes in suicide rate with gdp can be attributed to change in time.
Let’s see a global development of gdp per capita vs time:
data %>% group_by(year)%>%select(country, year, gdp_per_capita)%>%distinct()%>%summarise(gdp_per_capita=mean(gdp_per_capita))%>%
ggplot(aes(x=year, y=gdp_per_capita))+
geom_point()+
geom_smooth(method=lm)+
theme(legend.position = 'none')
So indeed we can see a strong linear relationship between global gdp per capita and time. This can tell us that gdp per capita doesn’t necessarily explane the development of suicide rate.
However, we can ask a different question. Is there a difference in suicide rate by wealth of a country.
Let’s have a look at how contries compare in suicide rate by their average gdp per capita. For this we will get the average GDP of a country and Average suicide rate (note that a point represents a country and we colour a point by continent):
group_by(data, country, continent) %>% summarise(suicide_rate = sum(suicides_no)/sum(population)*100000, avg_gdp=mean(gdp_per_capita))%>%
ggplot(aes(x=avg_gdp, y=suicide_rate, colour=continent))+
geom_point()+
labs(x='Average GDP per capita', y='Suicide rate per 100k', colour='Continent')
There is no obvious relation in this plot. There is quite high variability for countris with low gdp and it’s not clear how to split countries by income in this plot. What we can do instead is to add another catigorical variable by gdp which we will call Income Level and we will compare records by this grouping.
So, firstly let’s decide how we will split countries by income level. The idea is to split them by quartiles of average gdp per capita. Let’s have a look at box plot and quartile values:
gdp_by_country_year <- data %>% select(country, year, gdp_per_capita) %>% group_by(country,year) %>% distinct()
box_data <- boxplot(gdp_by_country_year$gdp_per_capita)
box_data$stats
## [,1]
## [1,] 251
## [2,] 3409
## [3,] 9448
## [4,] 25570
## [5,] 58531
We will pick the values of 1st, 2nd and 3nd quartiles as borders for income levels and we will divide income levels in these groups: Low, Lower-middle, Higher-middle and High.
So let’s augment the dataset and compare suicide rate by income level.
data_income_level <- data %>% mutate(income_level = case_when(gdp_per_capita < 3409 ~ 'Low',
gdp_per_capita>=3409 & gdp_per_capita < 9448 ~ 'Lower-middle',
gdp_per_capita>=9448 & gdp_per_capita <25570 ~ 'Higher-middle',
gdp_per_capita >=25570 ~ 'High')) %>%
mutate(income_level = factor(income_level, ordered=TRUE, levels = c('Low','Lower-middle','Higher-middle', 'High')))
group_by(data_income_level, year, income_level) %>% summarise(suicides_per_100k = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=income_level, y=suicides_per_100k, fill=income_level))+
geom_boxplot()+
labs(x='Income Level', y='Suicides per 100k people')+
theme(legend.position = 'none')
Ok, we can again see that there is a high variability in low income group, but now we can clearly see that with growing income there is growing suicide rate (which is quite unexpected).
Let’s analise this category further and combine it with continent and age groups.
group_by(data_income_level, continent, year, income_level) %>% summarise(suicides_per_100k = sum(suicides_no)/sum(population)*100000, gdp_per_capita=mean(gdp_per_capita)) %>%
ggplot(aes(x = income_level, y=suicides_per_100k, fill=continent))+
geom_boxplot()+
labs(x='Income Level', y='Suicides per 100k')
In this plot the first thing we can see is that suicide rate in Europe among low income group is much higher than in other continents. And secondly that suicide rate in Europe is desreasing with growing income. Then it seems that in Asia suicide rate seems to be increasing with growing income and same for Americas. Let’s separate continents into different plots:
group_by(data_income_level, continent, year, income_level) %>% summarise(suicides_per_100k = sum(suicides_no)/sum(population)*100000, gdp_per_capita=mean(gdp_per_capita)) %>%
ggplot(aes(x = income_level, y=suicides_per_100k, fill=continent))+
geom_boxplot()+
facet_wrap(~continent, scales='free')+
theme(legend.position = 'none', axis.text.x = element_text(angle = 45, hjust = 1))+
labs(x='Income Level', y='Suicides per 100k')
So here we can clearly see that in Europe suicide rate decreases with growing income level and for Americas and Asia it increases.
Let’s now plot income level with age:
group_by(data_income_level, age, year, income_level) %>% summarise(suicides_per_100k = sum(suicides_no)/sum(population)*100000, gdp_per_capita=mean(gdp_per_capita)) %>%
ggplot(aes(x = income_level, y=suicides_per_100k, fill=age))+
geom_boxplot()+
labs(x='Income Level', y='Suicides per 100k')
Here we can see that proportion of suicide rate by different age groups stays roughly the same across different income groups.
I don’t think it makes sense to draw some conclusion from this variable, since people from different countries that belong to same same generation grew up in different socio economic environments. Also, this data has problems as in sharp decrases in population.
group_by(data, year, generation) %>% summarise(suicides_t_pop = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=year, y=suicides_t_pop, colour=generation))+
facet_grid(generation ~ ., scales = 'free_y')+
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = x_year_ticks))+
labs(y='Suicides per 100k people')+
theme(legend.position = 'none')
group_by(data, year, generation) %>% summarise(population = sum(population)) %>%
ggplot(aes(x=year, y=population / 1000000, colour=generation))+
facet_grid(generation ~ ., scales = 'free_y')+
geom_line()+
geom_point()+
scale_x_continuous(breaks = scales::pretty_breaks(n = x_year_ticks))+
labs(y='Population (millions)')+
theme(legend.position = 'none')
Beside that, this value depends on age and we know from analysis above that age is a strong factor of suicide rate. So, in plots below we can see that the generations that were born earlier have higher suicide rate and they also belong to higher age group.
group_by(data, year, generation) %>% summarise(suicides_t_pop = sum(suicides_no)/sum(population)*100000) %>%
ggplot(aes(x=year, y=suicides_t_pop, colour=generation))+
geom_line()+
scale_x_continuous(breaks = scales::pretty_breaks(n = x_year_ticks))+
labs(y='Number of suicides per 100k people')
group_by(suicides_no_per_year_pop, year, age, generation) %>% summarise(suicides_t_pop = sum(suicides_per_pop)) %>%
ggplot(aes(x=year, y=suicides_t_pop, colour=generation))+
facet_grid(age ~ ., scales = "free_y")+
geom_line()+
geom_point()+
labs(y='Number of suicides per 100k people')
So in this exploratory analysis we found out several intresting pattern that affect suicide rate: